Dataset description

The dataset contains the following information on car properties:

data("mtcars")  #loading dataset
head(mtcars) #brief view to content
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The dataset contanins 32 unique elements which represent car models. The dataset is small. and will not provide robust information

Introductory graphical exploration

#distribution plots
ggplot(melt.data.frame(mtcars), aes (value)) +
  geom_histogram(bins=15,aes(y=..density..)) + geom_density() + 
  facet_wrap(~variable,scales = "free")
## Using  as id variables

Fig.1 Distribution plots of dataset variables (exploratory analysis)

Distributions of the variables are depicted in Fig. 1. Basing on the Fig. 1. it can be stated that variables cyl, gear, carb are of ordinal type. vs, am are nominal variables. Especially nominal variables should be taken into special consideration during clustering procedure and pattern identifications. In order to determine the possible linear correlations between pairs of variables the correlation matrix has been constructed. It is presented below in Tab. 1.

nc <- ncol(mtcars) #number of dataframe columns
hdr <- colnames(mtcars) # header (dataframe column names)

#correlation matrix
C <- round(cor(mtcars),2) #create correlation matrix
lowerTriangle(C) <- NA #for better readability purge lower triangular part
print(C)
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg    1 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl   NA  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp  NA    NA  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp    NA    NA    NA  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  NA    NA    NA    NA  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt    NA    NA    NA    NA    NA  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  NA    NA    NA    NA    NA    NA  1.00  0.74 -0.23 -0.21 -0.66
## vs    NA    NA    NA    NA    NA    NA    NA  1.00  0.17  0.21 -0.57
## am    NA    NA    NA    NA    NA    NA    NA    NA  1.00  0.79  0.06
## gear  NA    NA    NA    NA    NA    NA    NA    NA    NA  1.00  0.27
## carb  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA  1.00

In Fig. 2. a scatterplot of all variables pairs is presented. Further graphical exploration of the data has been driven in order to choose the proper variables from the viewpoint of this exercise.

pairs(mtcars)

Fig.2. Scatter plots of dataset variables (exploratory analysis)

Choosing pairs of variables which have sufficient correlation level.

interesting_correlation_level <- 0.6
vlist <- list()
plist <- list()
k<-0
for (i in 1:nc)
{
  if (i==nc) #yes! I am R and I don't care about being fast! I never have..., use while and stop complaining
    vec <- NULL
  else 
    vec <- (i+1):nc

  for (j in vec){
     if (abs(C[i,j]) > interesting_correlation_level)
     {
       #print(c(hdr[i],hdr[j]))
       k <- k+1
       vlist[[k]] <- c(hdr[i],hdr[j])
       local ({
         i<-i
         j<-j
         p <- ggplot(mtcars) + geom_point(aes( x= mtcars[hdr[i]], y= mtcars[hdr[j]] )) +xlab(hdr[i]) +ylab(hdr[j])
         plist[[k]] <<- p}) #symbolic manipulation in R is somewhat strange!
     }
  }
}

length(vlist)
## [1] 26

Plotting them

multiplot(plotlist = plist[1:9], cols = 3) #yeah, manually! For Gods sake!
## Loading required package: grid
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

multiplot(plotlist = plist[10:18], cols = 3)
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

multiplot(plotlist = plist[19:26], cols = 3)
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

linvlist <- c(1,2,4,5,7,8,10,12,13,14,16,17,19,20,22,25) #list of potentially interesting linear correlations (chosen manually)
multiplot(plotlist = plist[linvlist], cols = 4) #multiplot definition in different source file
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

Creating regression models

It is done manually in order to provide intuitive relation between variables. it is more reasonable that cyl variable has influence to mpg not the other way round.

  lmodel <- lm(data=mtcars,mpg~cyl) #it should rather be Ist type regression
  ggplot(data=mtcars,aes(x=cyl,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,mpg~disp) 
  ggplot(data=mtcars,aes(x=disp,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,mpg~drat) 
  ggplot(data=mtcars,aes(x=drat,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,mpg~wt) 
  ggplot(data=mtcars,aes(x=wt,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,disp~cyl) 
  ggplot(data=mtcars,aes(x=cyl,y=disp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,hp~cyl)
  ggplot(data=mtcars,aes(x=cyl,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,wt~cyl) 
  ggplot(data=mtcars,aes(x=cyl,y=wt)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,hp~disp) 
  ggplot(data=mtcars,aes(x=disp,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,drat~disp) 
  ggplot(data=mtcars,aes(x=disp,y=drat)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,wt~disp) 
  ggplot(data=mtcars,aes(x=disp,y=wt)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,hp~wt) 
  ggplot(data=mtcars,aes(x=wt,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,qsec~hp) 
  ggplot(data=mtcars,aes(x=hp,y=qsec)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,hp~carb) 
  ggplot(data=mtcars,aes(x=carb,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,drat~wt) 
  ggplot(data=mtcars,aes(x=wt,y=drat)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,drat~gear) 
  ggplot(data=mtcars,aes(x=gear,y=drat)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

  lmodel <- lm(data=mtcars,qsec~carb) 
  ggplot(data=mtcars,aes(x=carb,y=qsec)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])

PCA

To search for possible correlation dependencies PCA was performed on continuous and ordinal variables. Categorical variables were neglected.

   gPCAreduced <- PCA(mtcars[c("mpg","disp","hp","drat","wt","qsec")], scale.unit = TRUE, ncp = 6, graph = TRUE)

No surprise - it only confirms linear regression

Trying to linearly reduce the datase dimensionality

dPCAreduced <- PCA(mtcars[c("mpg","disp","hp","drat","wt","qsec")], scale.unit = TRUE, ncp = 6, graph = FALSE)
dPCAreduced$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.18739648             69.7899413                          69.78994
## comp 2 1.14811212             19.1352020                          88.92514
## comp 3 0.33335666              5.5559444                          94.48109
## comp 4 0.15436054              2.5726757                          97.05376
## comp 5 0.12479601              2.0799335                          99.13370
## comp 6 0.05197818              0.8663031                         100.00000

The dataset can be reduced to 3 or 4 features.

TODO

check non-linear mappings

#kPCA with some assumed params
kpc <- kpca(~.,data=mtcars,kernel="rbfdot",
            kpar=list(sigma=0.2),features=ncol(mtcars))

#print the principal component vectors
PC <-pcv(kpc)

mtc = mtcars

mtc$pc1 <- PC[,1]
mtc$pc2 <- PC[,2]
mtc$pc3 <- PC[,3]

 plot_ly(mtc, x = ~pc1, y = ~pc2, z = ~pc3) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = "PC1"),
                     yaxis = list(title = "PC2"),
                     zaxis = list(title = "PC3")))
## Warning: package 'bindrcpp' was built under R version 3.4.4

kPCA Seems quite promising.